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Abstract —This text investigates relations between two well- 
known family of algorithms, matrix factorisations and recursive 
linear filters, by describing a probabilistic model in which 
approximate inference corresponds to a matrix factorisation 
algorithm. Using the probabilistic model, we derive a matrix 
factorisation algorithm as a recursive linear filter. More precisely, 
we derive a matrix-variate recursive linear filter in order to 
perform efficient inference in high dimensions. We also show 
that it is possible to interpret our algorithm as a nontrivial 
stochastic gradient algorithm. Demonstrations and comparisons 
on an image restoration task are given. 

Index Terms —Matrix factorisation, Recursive least squares, 
Kalman filtering. 

I. Introduction 

ATRIX factorisation algorithms are one of the corner¬ 
stones of modern signal processing, machine learning, 
and, more generally, computational linear algebra. Formally 
the problem can be stated as factorising a data matrix Y £ 

R mxn as, 

Y ~ CX (1) 

where C £ ®> mXT ' is the dictionary matrix , and columns of 
X £ R rxn are coefficients , and r is the approximation rank. 
In our setup, all matrices will be real-valued. We are interested 
in to solve this problem in a recursive way, i.e. using a single 
data vector at each time to update factors. 

This is a well-known and well-studied problem. On the 
matrix factorisation side, the following works are related to 
our work. In Q], authors proposed a sequential Monte Carlo 
based nonnegative matrix factorisation (NMF) algorithm using 
a similar model to original probabilistic interpretation of NMF 
G). The model proposed in JT] is defined over columns of 
X, and C is regarded as a static but unknown variable, so 
estimated via maximum-likelihood techniques. In 0], authors 
propose a dynamic matrix factorisation with collaborative 
filtering applications in mind. In J4j, authors propose a matrix 
factorisation algorithm based on stochastic gradient descent 
(SGD). In our context, it can be applied column-wise. In (5), 
authors derive an online dictionary learning algorithm which is 
also related to SGD but they also impose sparsity assumptions 
on coefficients. An approach based on recursive least squares 
(RLS) can be found in |6). Also there is more recent work 
on dictionary learning based on RLS Q. These RLS-based 
approaches factorises the dictionary matrix (e.g. C = AB) 
and update each of these factors accordingly. Moreover these 
RLS-based works focus on supervised learning of dictionaries 
whereas here we are interested in unsupervised learning and 
applications without training phase (such as unsupervised 
image restoration). 


Update rules given in these papers are different than what 
we obtain (we do not update dictionary matrix by factoring it), 
and more importantly, we depart from a certain probabilistic 
model (instead of a cost function ) and derive the update 
rules as explicit inference rules. We derive matrix factorisation 
algorithms as approximate matrix-variate filtering algorithms 
in probabilistic models. The most related works to ours are 
HI and G) in which authors derive matrix-variate update rules 
for Hessian matrices as analytic inference rules in probabilistic 
models to obtain quasi-Newton algorithms from a probabilistic 
perspective. We follow the exact same approach, and our 
derivations follow those works. The model defined in (8j and 
ours are slightly different as 0 uses a model over square 
and symmetric matrices, but we define the model for non¬ 
square dictionary matrices (so we re-derive the update rules), 
and also the model definitions are slightly different. And 
finally, we apply these ideas to matrix factorisation problem. 
The provided probabilistic characterisation opens up many 
possibilities for incorporating further prior knowledge, or 
dealing with nonstationary data in a principled way by putting 
dynamics on the dictionary matrix. 

In the following subsection, we’ll give some identities which 
will be very useful in proofs. In Section [II] we describe our 
generative model for matrix factorisation. In Section [nD we 
derive our algorithm as an estimation and inference algorithm 
in the probabilistic model described in the Section [TT| In 
Section m we describe the relation between SGD based 
matrix factorisation, and our algorithm. In Section [VJ we 
demonstrate our algorithm on an image restoration task. In 
Section ED we conclude. 

A. Some useful linear algebra 

We will be heavily using the following identities from [10], 
in in this paper. Let A is of dimension m x r and B is of 
dimension r x n. Then the following holds, 

vec (AXB) = (B t ® A)vec(X) (2) 

A particular case where this identity will be useful for us is 
when dim(A) = m x r and dirn(H) = r x 1. So let us note 
the particular case in a more useful form to us, 

(a; T <g> Im)vec(A) = vec [Ax) = Ax. (3) 

where Ax is also a vector, dim(a;) = r x 1, and I m is m x m 
matrix. For a matrix M where dim(M) = m x r, let m = 
vec(M) is a mr x 1 vector. To revert this operation, we define 
the reshaping operator: vec“ 1 xr ,(77i) = M. Rronecker products 
also have the following mixed product property, 

(A <8> B)(C ® D) = (AC) <g> (BD), (4) 

and the following “inversion” property, 

(.A <g> B)~ 1 = A -1 <g> B~ x . 
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II. The Probabilistic Model 

Let Y £ l m x n be the data matrix, and C £ R mxr and 
X £ R rxn . Let us denote the z’th column of the data matrix 
Y with and [n] = {l,...,n}. The observations are 

generated in the following way: At time k, we randomly 
sample an index i k ~ [n\. And we set y k = Y(:,i k ). So 
y k denotes the observation at time k but not fc’th column. 
Similarly, the associated column of X is denoted with x k , and 
Xk = X(:,ik). For example if i k = 2, then y k would be the 
second column of Y, and Xk would be the second column of 
X. We denote the dictionary matrix with C, and c = vec(C). 
This also holds for c k = vec(Ck) where C k is to x r matrix 
stands for estimate of C at iteration k. 

In this work, we consider the following probabilistic model, 

p{c) =Af{c;co,Vo®Im) (6) 

p(yk\c,Xk ) = JV(yk; {x k <g> I m )c, A <g> J m ) (7) 

Note that Xk is a static unknown model parameter vector. On 
the other hand, c and yk are random vectors, and treated as 
such. To motivate the model, notice that using identity ((3} for 
(xj ®/ m )c, we can rewrite the likelihood ® in the following 
form, 

p(y k \c,x k ) = Af(y k -,Cxk,\® Im)- (8) 

In the matrix factorisation setup, we would like to assume 
yk ~ Cxk for each k, here this corresponds to assuming 
Gaussian noise. Using the model © and (J7J. we would like 
to estimate both Xk and C given the observations yi-.k, i.e. 
observations up to time k. 


B. Inference: Finding the dictionary matrix 

In this subsection, we assume x k is fixed and x k = x k , and 
we suppress x k from the notation. We consider the model ([6| 
and 0, and solve the posterior inference problem. We can 
rewrite this model in a generic way, 

p(c) = A7(c;c 0 ,P 0 ), 
p{yk\c) = J\f(y k -,H k c,R), 

where -Po = Vo 0 I and R = A (g> I. Since we fix Xk for 
all k , we suppress the x k from the notation, and use generic 
Hk observation matrix which is assumed to be known now. 
Given this model and fixed parameters, it is well-known that 
given observations up to time k, the posterior distribution 
p{c\yi-.k) is Gaussian too lfl2l . We denote this posterior density 
by p(c\yi-.k) = M(c; c k ,P k )- The mean Ck and covariance P k 
can be found by a recursive least squares filter (recursive linear 
hlter) algorithm. Given observations y \ :k , the mean c k is given 
by 02, 

Ck C-k— 1 T Pk-lR-k (P-kPk—lRk Rk) (Vk HkC k — l), 

and the covariance of the posterior is given by, 

P k = Pk -1 - Pk-iHk (HkPk-x Hj + R)- l H k P k -i- 

Implementing these update rules would be very inefficient 
as c £ K mr might be a very high-dimensional vector. This 
requires to store a huge observation matrix H k and a huge 
covariance matrix P k which can easily become an impractical 
problem to solve. But fortunately we can obtain a very efficient 
matrix-variate update rule using the following proposition. 


III. Parameter Estimation and Inference 

From the viewpoint of probabilistic (or Bayesian) inference, 
coefficients Xk are static parameters to be estimated (typically 
by some optimisation formulation), and in contrast, the dictio¬ 
nary matrix C is a latent variable that is to be inferred through 
its posterior distribution. In this section, well show how to 
perform parameter estimation for coefficients, and inference 
for the dictionary matrix. 

A. Parameter estimation: Finding coefficients 

To estimate the parameters x k associated with a given 
observation y k , we formulate the following maximisation 
problem, 

x* k = argmaxp(y fc |c fc _i,iEfc) (9) 

Xk 

Since this density is a Gaussian with mean C k -ix k , the 
solution is the pseudoinverse, 

x* k = (C'J_ 1 C' fc _ 1 )- 1 Cj_ lW . (10) 

Note that in this work, we use this update rule in the ex¬ 
periments. However, just to note, a very intriguing approach 
would be maximising the marginal likelihood p(y k \yi-. k ~i, x k ) 
by integrating out c. Unfortunately, the optimisation part is 
intractable and we will discuss this elsewher^H 

See the discussion at http://almoststochastic.com 


Proposition 1. The posterior mean c k which is given by, 

C k — C k —1 Y P k -\H k (H k P k -lH k “b Rk) ijjk R-kCk— l), 


can be rewritten as, 


n r, , (Vk-Ck-lX^xJV^ 

£k — C-fc_i H - yr— -—- 

X k Vk-lXk + A 


(ID 


Proof We put P k - \ = 14-1 ® Im ( see Prop. 2 to see this 
form holds for all k) and H k = xj ® I m and R k = A <g) I m , 
and arrive. 


Ck — Ck— 1 d" (I4_l 0 dm)(,X k ® Im) 

((x k ® Im )(Vk -1 ® Im )(%k 0 Im) + A 0 Im) X 

(yk ^ Im)ck— l)? 


Using the mixed product property (j4j three times, using 
and finally using ([3]) for the last term, one can arrive, 


Ck — Ck— 1 T 


Yk—lXk 


x k V k -iXk + A 


(jjk Ck— lX k ) 


Use (7] 


Using ([2} and reshaping with vec m x x? ,, we obtain Eq. ( I I I l i. ■ 


One can recover classical Broyden’s rule of quasi-Newton 
methods by setting I4_i = I. It is already known that Broy¬ 
den’s rule and other quasi-Newton algorithms are recursive 
least squares regressors 12, Thus, this is also a generali¬ 
sation of a matrix factorization algorithm we proposed in our 
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earlier work based on Broyden updates 113 . In the following 
proposition, we derive an efficient posterior covariance update 
to use in mean update (fTTT i. 

Proposition 2. The posterior covariance update, 

P k = P fc _! - P k -iHj{HkP^Hj + R)~ l H k P k _ x , 


can be rewritten as, 


Pk 


Vk-i — 


Vk-xXkX^Vk-i 

x k Vk-iXk + A 


0/rr 


V k 


( 12 ) 


Proof. We start by putting H k = xj 0 I m and R = A 0 I m - 
So we arrive, 


P k — P k — 1 P k — l(x k 0 Im)((x k 0 Irn)P k —l(x k 0 / m ) + 

A 0 I m ) (x k ® Im)P k — 1- 


We also put P k -i = V k -i 0 I m . We will show that this form 
holds also P k , and since P 0 is also of this form, by induction, 
we arrive that it holds for all k. Let us put, 

P k = (14—1 ® Im) (Vk— 1 & Im) (x k 0 Im) X 

((x k 0 0 Im)(x k 0 Im) f" A 0 I'm) X 

(x k 0 Im)(V k — l 0 Im)- 

By using mixed product property (O several times, we obtain, 


P k — (14 —1 0 Im) (V k —lX k 0 Im) X 

((xJVk^Xk + A) -1 ®I m )(xJ 14_i 0/ m ). 


where we also used property (0. Few more uses of mixed 
product property leads to, 

n /rr ^ r x Vk-lXkxJVk-1 ^ T 

P k — (14 — 1 ® Im) tt r \ ® Im- 

X^ v k — \X k T A 

Thus we can say that P k = V k 0 I m where. 


14 = 14-t 


V k -ix k xJV k -i 
xJV k -ix k + A 


(13) 


We give the overall algorithm in Algorithm Q] We name it as 
matrix factorisation based on recursive linear filter (MF-RLF). 


C. A variation: Filtering the dictionary matrix 

We define a little modification of the model <(6]) and i(7| and 
obtain a state-space model (SSM), 

p(c 0 ) =J\f(c 0 -,c 0 ,P 0 ) 
p(tt k \c k — i) — ff(c kl c k ~i , Q k ) 
p(y k \c k ) = N(y k \C k x k ,A® Im) 

where now c k variables are latent variables, and c k is the 
posterior mean estimate of the c k . All these quantities are 
again approximate because all of them are conditioned on X 
which is unknown, and to be estimated during the updates. 

Deriving matrix-variate Kalman filtering recursions for this 
model is very similar to what we did in the previous section. 


Algorithm 1 MF-RLF 
1: Initialise Co randomly and set k = 1. 

2 : repeat 

3: Pick y k = Y(:,i k ) where i k ~ [?r] uniformly random. 

4: Perform, 


x k = 

{yk Ck—lXk^Xfc 


Cfc — c k -1 + 
^ - 


A + xJV k -ix k 
V k -ix k xJ 14-1 
xTV k -ix k + A 


5; k — k -f- 1 
6: until convergence 


Algorithmically, it is a simple modification to the AlgorithmH] 
Define Q k = Qv®Im where Qv is rxr covariance matrix. So 
to obtain the matrix-variate Kalman filter, it suffices to perform 
the following step just before step 4 of the Algorithm [TJ 

V k \k-i = 14-i + Qv, 

and use V k \ k -i for updating C k and 14. We think that it could 
be very useful to develop an explicit model when one needs a 
“forgetting” property in the dictionary. It can be a principled 
alternative to what is called “forgetting factor” of the RLS 
when performing matrix factorisations. Qy can be actively 
used to add a dynamic to the dictionary matrix. We leave this 
potential application to the future work. 

IV. Relation to Stochastic Gradient Descent 

Our algorithm can be interpreted as a version of stochastic 
gradient descent with a nontrivial and non-scalar step size. The 
interpretation is given as follows: Suppose y k are iid draws 
conditioned on C (estimating X will be identical to previous 
case, so assume it is known), and we would like to maximise 
the following joint likelihood of the dataset, 

n 

p(Y\C,X)=l[p(y k \C,x k ), 

k=l 

and assume this likelihood is defined as 

p(y k \C,x k ) = J\f(y k -,Cx k ,I). 

Then after a bit of calculation, one can show that applying 
SGD to the negative log-likelihood results in the following 
iteration, 

C k = C k -i + 7 k(y k - C k -ix k )xj. (14) 

First of all, putting y k = 1 /(A + xjx k ) recovers Broyden’s 
rule again from a different perspective: maximum likelihood 
estimation via .SG/H. But note that this does not ensure that 
the usual assumptions on the step-size is satisfied, hence the 
convergence is questionable ||T4l . We note that, the update rule 
(fill) that is proposed in this paper is different than (ITdl) as we 
also have a matrix 14 which can not be embedded into the 
step-size in a trivial way. 

2 There are other ways, e.g. embedding step-size into the covariance. So 
this hints for an interesting connection between the step-size of the SGD and 
posterior covariance of the recursive linear-Gaussian models. 
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Original Images Corrupted Images MF-RLF SGDMF NMF 



(a) (b) (c) (d) (e) 


Fig. 1. Comparison of our algorithm with stochastic gradient descent MF (SGDMF), and nonnegative matrix factorization (NMF). SGDMF and MF-RLF 
passed 10 times over dataset recursively. NMF is run for 1000 batch iterations, (a) Some of original images, (b) We randomly removed %25 batch of all 
columns (for all 400 faces), (c) The result of MF-RLF (Algorithm Q}. (d) Result of SGDMF. (e) Result of NMF. SNR values: MF-RLF: 12.38. NMF: 12.35, 
SGDMF: 11.75 where initial SNR: 0.68. This clearly shows our algorithm competes with online as well as the offline benchmark algorithms on a standard 
task. 


V. Application to Image Restoration 

We demonstrate our algorithm on an image restoration 
task on the Olivetti dataset E). This dataset consists of 400 
face images of size 64 x 64. We vectorise each face into a 
column vector with dimension 4096, so m = 4096 in this 
problem. Since there are 400 faces in the dataset, n = 400. 
We chose r = 40 as an approximation rank and A = 2. 
We initialised factors randomly with¬ 
out imposing any structure. We choose 
Vo = I for this particular dataset, 
other choices lead to poorer perfor¬ 
mance. But it is entirely up to user to 
encode a prior knowledge about dic¬ 
tionary by using covariance matrix Vq 
that encodes a qualitative knowledge 
about the structure between r columns 
of the dictionary matrix. 

We deal with missing data using exact same methodology 
described in ED- So we define a mask M, and denote the 
mask associated with y k with m k . So in the Algorithm [T] we 
replace (y k - C k -iX k ) term by m k © (y k - C k -ix k ). Also 
while updating C k , we construct a special mask, 

M Ck = [m k , ■ ■ ■ ,m k \, 

K y ✓ 

r times 

and apply the following update, 

x k ={(Mc k _ 1 © C k -i) T (Mc k _ 1 ®C k - i)) _1 x 
{M Ck _i © C k -i) T (m k © y k ), 

in the Algorithm Q] for x k . Note that all these reformulations 
can be derived from the model by putting masks into the 
model. We left them out for simplicity. For the more details of 
this missing data handling scheme see EE). We use this both 
for SGD and MF-RLF. 

We give comparisons with both SGD and NMF. We give a 
comparison with NMF because we think that the most basic 
task of an online algorithm is to compete with the state-of-the- 
art batch methods. In general, many online algorithms fail at 
fulfilling this task because datasets which one can experiment 
batch algorithms are too small for online learning. In this 
section, we show that our algorithm fulfils this hard task: It 
works as good as NMF -the standard batch benchmark- on 
image restoration. As our algorithm bears some similarities 


TABLE I 

Table of SNR Values. 
Initial SNR: 0.68 


Algorithm 

SNR 

MF-RLF 

12.38 

SGDMF 

11.75 

NMF 

12.35 


with SGD, we also give a comparison with SGD as an 
online algorithm. The implementation is similar to ours - the 
C k update ( 1 1 41 subsequently followed by pseudoinverse. The 
visual results can be seen from Fig. |T] and SNR values are 
tabulated in Table Q] MF-RLF and SGDMF passed recursively 
10 times over the dataset, i.e. using a single observation each 
iteration. We ran NMF with 1000 batch passes over data. 
This shows these recursive algorithms uses data much more 
efficiently. 

Results show that our algorithm works well perceptually, 
and achieve same SNR values with NMF although it only 
passed 10 times over the dataset. 


VI. Conclusion 

We presented a matrix factorisation algorithm which makes 
use of linear filters. We recast the factorisation problem into 
the linear filtering problem, and propose efficient matrix- 
variate update rules for the Gaussian posterior summary 
statistics. The algorithm can trivially be extended for dy¬ 
namic models on dictionary matrix where one can model 
changing nature of the dataset in a principled way. For the 
future work, we think to extend this filtering approach to 
nonlinear and non-Gaussian state space models where the 
model structure can be much more richer than linear models. 
Putting a nonlinear dynamics on x k poses new challenges 
for sequential inference schemes in high-dimensions, and 
calls for Rao-Blackwellisation of state-of-the-art algorithms 
(such as fl5l ) proposed for high-dimensional filtering. Another 
potential use of our algorithm can be based on uncertainty esti¬ 
mates: Covariance uncertainty can be used to stop unnecessary 
computations, and save enormous time in a related fashion 
to probabilistic numerics m. We hope to pursue different 
methodological and application based directions for the future. 
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